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1 Introduction 


The goal of the TREC Web track over the past few years has been to explore 
and evaluate innovative retrieval approaches over large-scale subsets of the 
Web — currently using ClueWeb12, on the order of one billion pages. For 
TREC 2014, the sixth year of the Web track, we implemented the follow- 
ing significant updates compared to 2013. First, the risk-sensitive retrieval 
task was modified to assess the ability of systems to adaptively perform 
risk-sensitive retrieval against multiple baselines, including an optional self- 
provided baseline. In general, the risk-sensitive task explores the tradeoffs 
that systems can achieve between effectiveness (overall gains across queries) 
and robustness (minimizing the probability of significant failure, relative to 
a particular provided baseline). Second, we added query performance pre- 
diction as an optional aspect of the risk-sensitive task. The Adhoc task 
continued as for TREC 2013, evaluated using both adhoc and diversity rel- 
evance criteria. 

'This year, experiments by participating groups again used the ClueWeb12 
Web collection, a successor to the ClueWeb09 dataset that comprises about 
one billion Web pages crawled between Feb-May 2012.! The crawling and 
collection process for ClueWeb12 included a rich set of seed URLs based on 
commercial search traffic, T'witter and other sources, and multiple measures 
for flagging undesirable content such as spam, pornography, and malware. 

For consistency with last year's Web track, topic development was done 
using a very similar process to the one used in 2013. A common topic set 





‘Details on ClueWeb12 are available at http://boston.lti.cs.cmu.edu/cluewebi2 


of 50 additional new topics was developed and used for both the Adhoc 
and Risk-sensitive tasks. In keeping with the goal of reflecting authentic 
Web retrieval problems, the Web track topics were again developed from a 
pool of candidate topics based on the logs and data resources of commercial 
search engines. The initial set of candidates developed for the 2013 track 
was large enough that candidate topics not used in 2013 were used as the 
pool for the 2014 track. We kept the distinction between faceted topics, 
and unfaceted (single-facet) topics. Faceted topics were more like "head" 
queries, and structured as having a representative set of subtopics, with 
each subtopic corresponding to a popular subintent of the main topic. The 
faceted topic queries had subintents that were likely to be most relevant to 
users. Unfaceted (single-facet) topics were intended to be more like “tail” 
queries with a clear question or intent. For faceted topics, query clusters 
were developed and used by NIST for topic development. Only the base 
query was released to participants initially: the topic structures containing 
subtopics and single- vs multi-faceted vs. topic type were only released after 
runs were submitted. This was done to avoid biases that might be caused 
by revealing extra information about the information need that may not be 
available to Web search systems as part of the actual retrieval process. 

'The Adhoc task judged documents with respect to the topic as a whole. 
Relevance levels are similar to the levels used in commercial Web search, 
including a spam/junk level. The top two levels of the assessment structure 
are related to the older Web track tasks of homepage finding and topic 
distillation. Subtopic assessment was also performed for the faceted topics, 
as described further in Section 3. 

'Table 1 summarizes participation in the TREC 2014 Web Track. Overall, 
we received 42 runs from 9 groups: 30 ad hoc runs and 12 risk-sensitive runs. 
The number of participants in the Web track decreased over 2013 (when 15 
groups participated, submitting 61 runs). Seven runs were categorized as 
manual runs (4 adhoc, 3 risk), submitted across 2 groups: all other runs 
were automatic with no human intervention. All submitted runs used the 
main Category A corpus: none used the Category B subset of ClueWeb12. 

'The submitting groups were: 


Task | Adhoc | Risk | Total 
Groups 9 5 9 





Runs 30 12 42 





Table 1: TREC 2014 Web Track participation. 


Carnegie Mellon University and Ohio State University 
Chinese Academy of Sciences 

Delft University of Technology 

Medical Informatics Laboratory 

University of Delaware (Carterette) 

University of Delaware (Fang) 

University of Glasgow (Terrier Team) 

University of Massachusetts Amherst 

University of Twente 


Three teams submitted at least one run with an associated Query Per- 
formance Prediction file. 

In the following, we recap on the corpus (Section 2), and topics (Sec- 
tion 3) used for TREC 2014. Section 4 details the pooling and evaluation 
methodologies applied for Adhoc and Risk-Sensitive tasks, as well as the 
results of the participating groups. Section 5 examines sources of varia- 
tion across submitted runs using Principal Components Analysis. Section 6 
details the efforts of participants on the query performance sub-task. Con- 
cluding remarks follow in Section 7. 


2 ClueWeb12 Category A and B corpus 


As with ClueWeb09, the ClueWeb12 corpus comes with two datasets: Cat- 
egory A, and Category B. The Category A dataset is the main corpus and 
contains about 733 million documents (27.3 TB uncompressed, 5.54 TB 
compressed). The Category B dataset is a sample from Category A, con- 
taining about 52 million documents, or about 7% of the Category A total. 
Details on how the Category A and B corpora were created may be found 
on the Lemur project website”. We strongly encouraged participants to use 
the full Category A data set if possible. All of the results in this overview 
paper are labeled by their corpus category. 





“http: //lemurproject .org/clueweb12/specs. php 


3 Topics 


NIST created and assessed 50 new topics for the TREC 2014 Web track. As 
with TREC 2013, the TREC 2014 Web track included a significant propor- 
tion of more focused topics designed to represent more specific, less frequent, 
possibly more difficult queries. To retain the Web flavor of queries in this 
track, we kept the notion that some topics may be multi-faceted, i.e. broader 
in intent and thus structured as a representative set of subtopics, each re- 
lated to a different potential aspect of user need. Examples are provided 
below. For topics with multiple subtopics, documents were judged with re- 
spect to each of the subtopics. For each subtopic, NIST assessors made a 
six-point judgment scale as to whether or not the document satisfied the 
information need associated with the subtopic. For those topics with mul- 
tiple subtopics, the set of subtopics was intended to be representative, not 
exhaustive. 

Subtopics were based on information extracted from the logs of a com- 
mercial search engine, based on a pool of remaining topic candidates created 
but not sampled for the 2013 Web track. Topics having multiple subtopics 
had subtopics selected roughly by overall popularity, which was achieved 
using combined query suggestion and completion data from two commercial 
search engines. In this way, the focus was retained on a balanced set of pop- 
ular subtopics, while limiting the occurrence of strange and unusual inter- 
pretations of subtopic aspects. Single-facet topic candidates were developed 
based on queries extracted from search log data that were low-frequency 
(“tail-like”) but issued by multiple users; less than 10 terms in length; and 
relatively low effectiveness scores across multiple commercial search engines 
(as of January 2013). 

'The topic structure was similar to that used for the T'REC 2009 topics. 
An example of a single-facet topic: 


«topic number="293" type="single"> 

<query>educational advantages of social networking sites</query> 
<description> 

What are the educational benefits of social networking sites? 
</description> 

</topic> 


An example of a faceted topic: 


<topic number="289" type="faceted"> 


<guery>benefits of yoga</guery> 

<description>What are the benefits of yoga for kids?</description> 

<subtopic number="1" type="inf">What are the benefits of yoga for kids?</subtopic> 
<subtopic number-"2" type="inf">Find information on yoga for seniors.</subtopic> 

<subtopic number-"3" type="inf">Does yoga help with weight loss?</subtopic> 

<subtopic number="4" type="inf">What are the benefits of various yoga poses?</subtopic> 
<subtopic number-"5" type="inf">What are the benefits of yoga during pregnancy?</subtopic> 
<subtopic number-"6" type="inf">How does yoga benefit runners?</subtopic> 

<subtopic number-"7" type="inf">Find the benefits of yoga nidra.</subtopic> 

</topic> 


The initial release of topics to participants included only the guery field, as 
shown in the excerpt here: 


251:identifying spider bites 
252:history of orcas island 
253:tooth abscess 
254:barrett's esophagus 
255:teddy bears 


As shown in the above examples, those topics with a clear focused intent 
have a single subtopic. Topics with multiple subtopics reflect underspec- 
ified gueries, with different aspects covered by the subtopics. We assume 
that a user interested in one aspect may still be interested in others. Each 
subtopic was informally categorized by NIST as being either navigational 
("nav") or informational (“inf”). A navigational subtopic usually has only a 
small number of relevant pages (often one). For these subtopics, we assume 
the user is seeking à page with a specific URL, such as an organization's 
homepage. On the other hand, an informational query may have a large 
number of relevant pages. For these subtopics, we assume the user is seek- 
ing information without regard to its source, provided that the source is 
reliable. 

For the adhoc task, relevance is judged on the basis of the description 
field. T'hus, the first subtopic is always identical to the description sentence. 


4 Methodology and Measures 


4.1 Pooling and Judging 


For each topic, participants in the adhoc and risk-sensitive tasks submitted 
a ranking of the top 10,000 results for that topic. All submitted runs were 
included in the pool for judging (with the exception of 2 runs from 1 group 
that were marked as lowest judging priority and exceeded the per-team task 
limit in the guidelines). A common pool was created from the runs submitted 
to both tasks, which were pooled to rank depth 25. 

For the risk-sensitive task, versions of ndeval and gdeval supporting 
the risk-sensitive versions of the evaluation measures (described below) were 
provided to NIST. 'These versions were identical to those used in last year's 
track except for a minor adjustment in output formatting. 

All data and tools required for evaluation, including the scoring programs 
ndeval and gdeval as well as the baseline runs used in computation of the 
risk-sensitive scores are available in the track’s github distribution?. 

The relevance judgment for a page was one of a range of values as de- 
scribed in Section 4.2. All topic-aspect combinations this year had a non- 
zero number of known relevant documents in the ClueWeb12 corpus. For 
topics that had a single aspect in the original topics file, that one aspect is 
used. For all other topics, aspect number 1 is the single aspect. All topics 
were judged to depth 25. 


4.2 Adhoc Retrieval Task 


An adhoc task in TREC provides the basis for evaluating systems that 
search a static set of documents using previously-unseen topics. The goal 
of an adhoc task is to return a ranking of the documents in the collection 
in order of decreasing probability of relevance. The probability of relevance 
for a document is considered independently of other documents that appear 
before it in the result list. For the adhoc task, documents are judged on the 
basis of the description field using a six-point scale, defined as follows: 


1. Nav: This page represents a home page of an entity directly named 
by the query; the user may be searching for this specific page or site. 
(relevance grade 4) 


2. Key: This page or site is dedicated to the topic; authoritative and 
comprehensive, it is worthy of being a top result in a web search engine. 





3nttp: //github.com/trec-web/trec-web- 2014 


(relevance grade 3) 


3. HRel: The content of this page provides substantial information on 
the topic. (relevance grade 2) 


4. Rel: The content of this page provides some information on the topic, 
which may be minimal; the relevant information must be on that page, 
not just promising-looking anchor text pointing to a possibly useful 
page. (relevance grade 1) 


5. Non: The content of this page does not provide useful information on 
the topic, but may provide useful information on other topics, includ- 
ing other interpretations of the same query. (relevance grade 0) 


6. Junk: This page does not appear to be useful for any reasonable pur- 
pose; it may be spam or junk (relevance grade -2). 


After each description we list the relevance grade assigned to that level 
as they appear in the judgment (qrels) file. These relevance grades are also 
used for calculating graded effectiveness measures, except that a value of -2 
is treated as 0 for this purpose. 

'The primary effectiveness measure for the adhoc task was intent-aware 
expected reciprocal rank (ERR-IA) which is a diversity-based variant of ERR 
as defined by Chapelle et al. [1] that accounts for faceted topics. For 
single-facet topics, ERR-IA simply becomes ERR. We also report an intent- 
aware version of nDCG, a-nDCG [3], and novelty- and rank-biased precision 
(NRBP) [2]. Table 2 presents the (diversity-aware) performance of the par- 
ticipating groups in the Adhoc task, ranked by ERR-IA@20 and selecting 
each group's highest performing run among those they submitted to the 
Adhoc task. The applied measures, ERR-IAG20, a-nDCG@20, and NRBP, 
take into account the multiple possible subintents underlying a given topic, 
and hence measure if the participants systems would have performed effec- 
tive retrieval for such multi-faceted queries. Of note, the highest performing 
run was a manual run. Moreover, while category B runs were permitted, no 
participanting groups chose to submit category B runs. 

We also report the standard (non-diversity-based) ERR@20 and nDCG 20 
effectiveness measures for the Adhoc task in Table 3. We note that these 
rankings exhibit some differences from Table 2, demonstrating that some 
systems may focus upon single dominant interpretations of a query, without 
trying to uncover other possible interpretations. 

Finally, Figure 1 visualizes the per-topic variability in ERR@20 across 
all submitted runs. For many topics, there was relatively little difference 


Table 2: Top ad-hoc task results (diversity-based measures), ordered by ERR- 
IAQ20. Only the best automatic run according to ERR-IAQ20 from each group is 
included in the ranking. Only one team submitted a manual run that outperformed 
automatic — the highest manual run from that team (udel fang) is included as well. 





Group Run Cat Type ERR-IAG20 o-nDCGQ20 NRBP 
udel fang UDInfoWebLES A manual 0.688 0.754 0.656 
udel fang UDInfoWebAX A auto 0.608 0.694 0.564 
uogTr uogIrDwl A auto 0.595 0.682 0.548 
BUW webisWtl4axMax A auto 0.589 0.667 0.550 
udel udelCombCAT2 A auto 0.583 0.656 0.545 
wistud wistud.runB A auto 0.583 0.660 0.543 
ICTNET ICTNET14ADR3 A auto 0.580 0.652 0.541 
Group.Xu Terran A auto 0.578 0.647 0.541 
UMASS CIR  CiirAlll A auto 0.558 0.639 0.512 
Organizersl Terrier Base A auto 0.542 0.627 0.501 
ut utexact A auto 0.535 0.612 0.494 
SNU Medinfo SNUMedinfo12 A auto 0.531 0.624 0.481 
Organizers2 IndriBase A auto 0.513 0.585 0.474 





between the top runs and the median, according to some of the effectiveness 
measures (e.g. ERR@20 and some diversity measures). As a result, a small 
number of topics tended to contribute to most of the variability observed 
between systems. In particular, topics 298, 273, 253, 293, 269 had especially 
high variability across systems. In comparing the Indri and Terrier baselines 
used for risk-sensitive evaluation: absolute difference in ERRG20 between 
the baselines was greater than 0.10 for 17 topics, and greater than 0.20 for 
7 topics. Expressed as a relative percentage gain/loss, there were 18 topics 
for which the Terrier baseline ERR@20 was at least 50% higher than the 
Indri baseline, and 6 topics where the Indri baseline was at least 50% higher 
than the Terrier baseline. 


4.3 Risk-sensitive Retrieval Task 


The risk-sensitive retrieval task for Web evaluation rewards algorithms that 
not only achieve improvements in average effectiveness across topics (as in 
the adhoc task), but also maintain good robustness, which we define as 
minimizing the risk of significant failure relative to a given baseline. 
Search engines use increasingly sophisticated stages of retrieval in their 
quest to improve result quality: from personalized and contextual re-ranking 
to automatic query reformulation. These algorithms aim to increase retrieval 















































































































































1.0 4 


0.8 4 


6 
4 
2 
0.0 — 


OZONNA 


46€ 
6S¢ 
66€ 
89€ 
€8¢ 
coc 
88€ 
eSC 
99c 
ESZ 
09€ 
9S¢ 
4S€ 
86€ 
vec 
LSZ 
96€ 
c6€ 
S8€ 
v8c 
ELE 
9€ 
49€ 
48€ 
86€ 


Topic 


(a) Top 25 topics (by descending Terrier baseline ERR@20) 








ot---[[ ] - 68z 

o d F gzz 

pe {£ reg 
ooa] + SSc 

o E) vz 


o oo o HIT] E Zig 












































1.0 4 





o Flip 19% 

E--d p + tez 

oo t ozz 

oo +--[]} + 00€ 

o off - izz 

o r-| [IH p saz 

o r-[H rez 

o co +-f{ [}-- - 082 

r--1 [Lu F voz 

--[ | F 182 

H--[ [H-4 + sez 

o HII--4 - eec 

oo}--{ |] + sec 

HT + oez 

œw- le  - psz 

ETT F 6z 

(I--9 E zez 

+---[] }--4 H sez 

Sp I | )-- | cez 

| 

x Qq 2 
o o o 


OZONNI 


Topic 


(b) Bottom 25 topics (by descending Terrier baseline ERR@20) 


Boxplots for TREC 2014 Web topics, showing variation in 


Figure 1: 


Topics are sorted by de- 
creasing Terrier baseline ERR (220 (bite bar) with Indri baseline also shown 


ERR@20 effectiveness across all submitted runs. 
(light pink bar). 


Table 3: Top ad-hoc task results (non-diversity-based) ordered by ERR@20. Only 
the best automatic run according to ERR@20 from each group is included in the 
ranking. Only one team submitted a manual run that outperformed automatic — 
the highest manual run from that team (udel fang) is included as well. 





Group Run Cat Type ERRG20 nDCG@20 
udel fang UDInfoWebRiskTR | A manual 0.233 0.325 
ICTNET ICTNETI4ADRI A auto 0.208 0.261 
udel fang UDInfoWebAX A auto 0.207 0.307 
Group.Xu Terran A auto 0.204 0.294 
uog Tr uogTrDwl A auto 0.195 0.324 
Organizersl Terrier Base A auto 0.189 0.260 
udel udelCombCAT2 A auto 0.179 0.261 
SNUMedinfo SNUMedinfol2 A auto 0.176 0.270 
BUW webisWt14axAll A auto 0.174 0.258 
wistud wistud.runB A auto 0.174 0.291 
ut utexact A auto 0.172 0.226 
Organizers2 IndriBase A auto 0.153 0.243 
UMASS CIR  CiirAlll A auto 0.153 0.250 





effectiveness on average across queries, compared to a baseline ranking that 
does not use such operations. However, these operations are also risky since 
they carry the possibility of failure — that is, making the results worse than if 
they had not been used at all. The goal of the risk-sensitive task is two-fold: 
1) To encourage research on algorithms that go beyond just optimizing av- 
erage effectiveness in order to effectively optimize both effectiveness and ro- 
bustness, and achieve effective tradeoffs between these two competing goals; 
and 2) to explore effective risk-aware evaluation criteria for such systems. 
'The risk-sensitive retrieval track is related to the goals of the earlier 
TREC Robust Track (TREC 2004, 2005),4 which focused on increasing re- 
trieval effectiveness for poorly-performing topics using evaluation measures 
such as geometric MAP that focused on maximizing the average improve- 
ment on the most difficult topics. The risk-sensitive retrieval track can be 
thought of as a next step in exploring more general retrieval objectives and 
evaluation measures that (a) explicitly account for, and can differentiate 
systems based on, differences in variance or other risk-related statistics of 
the win/loss distribution across topics for a single run, (b) the quality of the 





^http://trec.nist.gov/data/robust/04.guidelines.html 
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curve derived from a set of tradeoffs between effectiveness and robustness 
achievable by systems, measured across multiple runs at different average 
effectiveness levels, and (c) computing (a) and (b) by accounting for the effec- 
tiveness of a competing baseline (both standard, and participant-supplied) 
as a factor in optimizing retrieval performance. 

Two runs were provided as standard baselines, derived from the Indri? 
and Terrier? retrieval engines. For Indri, we used a pseudo-relevance feed- 
back run as implemented by the Indri retrieval engine. Specifically, for 
each query, we used 10 feedback documents, 20 feedback terms, and a lin- 
ear interpolation weight of 0.60 with the original query. For the Terrier 
standard baseline, documents were ranked using the DPH weighting model 
from the Divergence from Randomness framework. For both systems, we 
provided baselines with and without application of the Waterloo spam clas- 
sifier, where the filtered runs removed all documents with a percentile-score 
less than 70 from the rankings’. 

As with the adhoc task, we use Intent-Aware Expected Reciprocal Rank 
(ERR-IA) as the basic measure of retrieval effectiveness, and per-query re- 
trieval delta is defined as the absolute difference in effectiveness between a 
contributed run and the above standard baseline run, for a given query. A 
positive delta means a win for the system on that query, and negative delta 
means a loss. For single runs, the following will be the main risk-sensitive 
evaluation measure. Let A(g) = Ra(q) — RpaAsg(q) be the absolute win 
or loss for query q with system retrieval effectiveness R4(g) relative to the 
baseline’s effectiveness Rgasp(g) for the same query. We categorize the 
outcome for each query q in the set Q of all N queries according to the sign 
of A(q), giving three categories: 

Hurt Queries (Q_) have A(g) < 0; Unchanged Queries (Qo) have A(g) = 0; 
Improved Queries (Q+) have A(q) > 0. 

The risk-sensitive utility measure Unrsk(Q) of a system over the set of 
queries Q is defined as: 


Unisk(Q) =1/N - [ZgeQ, Alq) — (a + 1)2ge0- A(a)] (1) 


where a is the risk-aversion parameter. In words, this rewards systems 
that maximize average effectiveness, but also penalizes losses relative to the 
baseline results for the same query, weighting losses a + 1 times as heavily 
as successes. For example, when the risk aversion parameter a is large, this 





Šnttp://www.lemurproject.org/indri/ 
Šhttp://terrier.org 
"http: //www.mansci.uwaterloo.ca/-msmucker/cw12spam/ 
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rewards systems that are more conservative and able to avoid large losses 
relative to the baseline. T'he adhoc task objective, maximizing only average 
effectiveness across queries, corresponds to the special case a — 0. 

Table 5 summarizes the results for all runs submitted for the risk-sensitive 
retrieval task, according to the Unrsg evaluation measure with a = 5, using 
each of the Indri and Terrier baselines. The average Uprsx across both base- 
lines is also reported. Notably, the best average and Terrier-relative URrsK 
was achieved by manual runs (in the top 3 places). However, the best Indri- 
relative Uprsy was achieved by an automatic run. Also, the relative ranking 
of runs by Uprsg changes considerably with the choice of baseline. This may 
indicate that some systems were tuned to optimize UpRrsx for one baseline 
but not the other — or for no baseline at all. 

In last year's risk-sensitive task, participants were asked to submit a 
set of runs that were optimized for different levels of risk aversion, e.g. by 
optimizing for Uprs,x using different pre-specified values of o. However, 
not enough groups attempted this to allow for meaningful analysis. As a 
result, we adjusted the task this year so that participants were requested to 
submit risk-sensitive runs each of which was optimized against one of two 
different baselines. Participants also had the opportunity to include their 
own self-defined baseline runs. The goal was to reward systems that could 
successfully adapt to multiple baselines. Table 4 gives the complete set of 
Uprsr results (a = 5) for the official and self-defined baselines. 


5 Analysis of variation in effectiveness across runs 


Using Principal Component Analysis (PCA) on topic retrieval effectiveness 
scores across all submitted runs can help identify underlying factors that 
account for variation across systems. Figure 2 shows a biplot based on PCA 
of standardized retrieval effectiveness scores (ERR@20 and nDCG@20) for 
all 2014 topics and runs. Runs are plotted as circles, with selected runs 
labeled with TREC run name. Topics are plotted as plus signs. (Details on 
PCA plots for IR can be found in Dinçer [5].) 

Topics and runs near the origin have a mean effectiveness score close 
to the group mean. Topics and runs far from the origin along one or both 
component dimensions are those that have more influence on the group 
ranking (based on average ERR@20 or nDCG@20 respectively). Topics 
that are close together in the plot are those with similar effectiveness profiles 
across runs. Likewise, runs that are close together had similar effectiveness 
profiles across topics. 
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Figure 2: Principal components biplot of TREC 2014 Web Track runs and 
topics, based on ERRG20 (top) and nDCG@20 (bottom). 


More than 80% of the overall variation in retrieval effectiveness was 
captured by a single (first) principal component, for both ERRG20 (80.9%) 
and nDCG@20 (86.6%). A subset of about 10 topics had large coordinate 
values (greater than 1) associated with this first principal component, and 
thus had a large effect on the overall ranking of systems by average retrieval 
effectiveness. Examples of such topics include topic 267 (feliz navidad lyrics) 
for ERR@20 and topic 297 (altitude sickness) for NDCG@20. (These results 
are in accord with those of Figure 1 that visualizes per-topic variability 
across runs.) 

The second principal component accounted for much less remaining vari- 
ation for both ERR@20 (6.1%) and nDCG@20 (2.8%). A set of 3-5 topics 
(ERRG20 vs nDCGG20) were strongly associated with the second compo- 
nent. These included topic 269 (marshall county schools) and topic 298 
(medical care and jehovah's witnesses). We can see from the biplot that 
some runs (e.g. UDInfoWebRiskTR) did well on the first-component topics 
(and these were typically highly-ranked overall) but less so on the second- 
component topics, while other runs (e.g. utexact and ICTNET14ADR1) did 
well on the second-component topics. 

The separation between first- and second-component runs could be due 
to different choices of text representation or feature weighting. For exam- 
ple, similarities between the runs that did well on second-component topics 
(utexact and ICTNET14ADR1) might be due to their focus on anchor text. 
Better understanding of these factors is a topic for future investigation. 


6 Query Performance Prediction 


Determining the effectiveness of a retrieval, be it a baseline or a treatment, is 
one of the fundamental subtasks in risk-aware ranking. As such, in addition 
to the core document ranking tasks, the web track included a query per- 
formance prediction subtask, where participants were asked to rank topics 
according to the predicted performance of the baseline and the treatment. 
The format of the evaluation followed the TREC 2004 Robust Retrieval 
Track with minor variations [7]. Participants were asked to output a score for 
each topic: (a) a prediction score for the absolute effectiveness of the results 
for baseline used for risk-sensitive run, (b) a prediction score for the absolute 
effectiveness of the results for the risk-sensitive run, and (c) a relative gain or 
loss prediction score (the difference in effectiveness between the risk-sensitive 
run and the baseline run). We evaluated the ability of participants to predict 
the rank ordering of per-topic performance and, therefore, measured the 
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Kendall's 7 correlation between the system performance prediction scores 
and the actual topic performance in terms of ERRG20. 

We present summary results in Table 6 and scatterplots of per-topic 
prediction scores versus ERR@20 in Figures 3, 4, and 5. 

We found that participant performance predictors were weakly correlated 
with the actual topic performance. On an absolute scale, the values of 
Kendall's 7 are low, with an mean 7 of < 0.1 across all prediction tasks. 
The Kendall's 7 values tend to lie between 0.1 and 0.5 for web topics [6, 
Table 4] and 0.3 and 0.5 for non-web tasks [4, Table 1], suggesting that the 
2014 topics may be more confusing to systems than previous collections. 

For both baseline and risk runs in isolation, participant predictions were 
more weakly correlated (Thaseline = 0.04, Triskrun = 0.03) compared to predic- 
tions of relative improvements (Trelative = 0.07). This observation suggests 
that detecting within-topic performance may be easier and, as a result, sup- 
ports risk-sensitive retrieval as a research direction. 


7 Conclusions 


This is the last year for the Web track at TREC in its current form. Over the 
past 6 years, the Web track has developed resources including 300 topics with 
associated relevance judgments, across two separate corpora: ClueWeb09 
and ClueWeb12. 

Particular areas that we believe that have benefited from the TREC 
Web track include approaches for learning-to-rank, diversification and the 
efficiency /effectiveness tradeoff, as well as risk-sensitive models and evalua- 
tion. In particular, we believe that there is much further scope for promising 
research in the area of risk-sensitive retrieval, building upon the resources 
created by the TREC Web track. 
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Figure 3: Baseline Performance Prediction Results 
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Figure 4: Risk Run Performance Prediction Results 
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Figure 5: Relative Performance Prediction Results 
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Table 4: Risk-sensitive effectiveness of submitted risk runs relative to official and 
self-defined baselines. 































































































Run Baseline URrisk, 455 
CTNETIARSRI indri -0.01702 
CTNETI4RSR1 terrier -0.25635 
CTNETI4RSR1 CTNET14ADR1 -0.32634 
CTNETI4RSR1 CTNETI4ADR2 -0.31050 
CTNETI4RSR1 CTNETI4ADR3 -0.32254 
CTNETI4RSR2 indri -0.11972 
CTNETI4RSR2 terrier -0.18145 
CTNETI4RSR2 CTNET14ADR1 -0.32083 
CTNET14RSR2 CTNETI4ADR2 -0.30409 
CTNETI4RSR2 CTNETI4ADR3 -0.31601 
CTNETI4RSR3 indri -0.15928 
CTNETI4RSR3 terrier -0.20986 
CTNETI4RSR3 CTNET14ADR1 -0.02661 
CTNETI4RSR3 CTNETI4ADR2 -0.01318 
CTNETI4RSR3 CTNETI4ADR3 -0.01554 
udelCombCAT2 indri -0.11092 
udelCombCAT2 terrier -0.27713 
udelCombCAT2 udel.itu -0.37582 
udelCombCAT2 udel.itub -0.32152 
UDInfoWebRiskAX indri -0.0949'7 
UDInfoWebRiskAX terrier -0.13282 
UDInfoWebRiskAX UDInfoWebAX -0.09389 
UDInfoWebRiskAX UDInfoWebENT -0.09006 
UDInfoWebRiskAX UDInfoWebLES -0.06893 
UDInfoWebRiskRM indri -0.07929 
UDInfoWebRiskRM terrier -0.12730 
UDInfoWebRiskRM UDInfoWebAX -0.09723 
UDInfoWebRiskRM UDInfoWebENT -0.08083 
UDInfoWebRiskRM UDInfoWebLES -0.07408 
UDInfoWebRiskTR indri -0.05661 
UDInfoWebRiskTR terrier -0.10552 
UDInfoWebRiskTR UDInfoWebAX -0.08702 
UDInfoWebRiskTR UDInfoWebENT -0.07484 
UDInfoWebRiskTR UDInfoWebLES -0.06485 
uog TrBwf indri -0.13225 
uogTrBwf terrier -0.22992 
uogTrBwf uogTrDuax -0.21253 
uogTrBwf uogTrDw -0.26402 
uogTrBwf uogTrIwa -0.12952 
uog TrDwsts indri -0.12092 
uogTrDwsts terrier -0.26885 
uogTrDwsts uogTrDuax -0.26911 
uogTrDwsts uogTrDw -0.27401 
uogTrDwsts uogTrIwa -0.20727 
uogTrgl indri -0.12489 
uogTrgl terrier -0.22741 
uogTrgl uogTrDuax -0.19293 
uogTrgl uogTrDw -0.22614 
uogTrgl uogTrIwa -0.18079 
wistud.runD indri -0.15582 
wistud.runD terrier -0.23495 
wistud.runD wistud.runA -0.04875 
wistud.runD wistud.runB -0.18761 
wistud.runD wistud.runC -0.18761 
wistud.runE indri -0.17293 
wistud.runE terrier -0.35354 
wistud.runE wistud.runA -0.21114 
wistud.runE wistud.runB -0.19061 
wistud.runE wistud.runC -0.19061 
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Table 5: Risk-sensitive effectiveness of submitted risk runs, sorted by average 
descending Uprs, across baselines with a = 5. 





Run Type Indri Terrier Average 
Urisk Urisx URisK 
UDInfoWebRiskTR | manual -0.057 -0.106 -0.081 
UDInfoWebRiskRM | manual -0.079 -0.127 -0.103 
UDInfoWebRiskAX manual -0.095 -0.133 -0.114 
ICTNET14RSR1 automatic -0.017 -0.256 -0.137 
ICTNET14RSR2 automatic -0.120 -0.182 -0.150 
uogTrq1 automatic -0.125 -0.227 -0.176 
uogTrBwf automatic -0.132 -0.230 -0.181 
ICTNET14RSR3 automatic -0.159 -0.210 -0.185 
udelCombCAT2 automatic -0.111 -0.277 -0.194 
uogTrDwsts automatic -0.121 -0.270 -0.195 
wistud.runD automatic -0.156 -0.235 -0.195 
wistud.runE automatic -0.173 -0.354 -0.263 





baseline riskrun relative 
ICTNET14RSR1 -0.0200 0.0114 0.2169 
ICTNETIARSR2 0.0196 0.0311 -0.0311 





uog IrBwf 0.08341 0.1743 0.1927 
uog TrDwsts 0.1744 -0.0409 -0.0588 
wistud.runD 0.0233 -0.0310 0.0425 
wistud.runE 0.0233 . 0.0230 - 


Table 6: Kendall’s 7 between system predicted value of per-topic ERR 20 
and observed ERR 20 for baseline and risk runs. The third column presents 
T between the system predicted value of relative ERRG20 and observed 
relative ERR@20. 
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